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(54) Abstract Title 

Communication device for endpointing speech utterances 

(57) A communication device capable of endpointing 
speech utterances includes a speech/noise classifier and 
speech recognition technology. A speech signal is 
analysed to determine speech waveform parameters 
within a speech acquisition window 215. The speech 
waveform parameters are compared to determine the start 
and end points of the speech utterance. Processing starts 
at a frame index based on the energy centroid of the 
speech utterance and analyzes the frames preceding and 
following the frame index to determine the endpoints. 
When a potential endpoint is identified, the cumulative 
nergy is compared to the total energy of the speech 
acquisition window to determine whether additional - 
speech frames are present 255,280. Accordingly, gaps and 
pauses in the utterance will not result in an erroneous 
endpoint determination. 
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USER ACTIVATES SPEECH 
RECOGNITION TECHNOLOGY 



USER PROVIDES 
SPEECH INPUT 
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ANALYZE SPEECH SIGNAL 
TO DETERMINE SPEECH 
WAVEFORM PARAMETERS 
INCLUDING ENERGY CENTROID, 
CUMULATIVE FRAME ENERGY, 
AND TOTAL WINDOW ENERGY 
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population of expected users are averaged in some manner to create a word model for 
that word. By averaging speech parameters for the same word spoken by different 
people, the word model should be usable by most if not all people. 

In speaker dependent speech recognition devices, the user trains the device by 
speaking the particular word when prompted by the device. The speech recognition 
technology then creates a word model based on the input from the user. The speech 
recognition technology may prompt the user to repeat the word any number of times 
and then average the speech waveform parameters in some manner to create the word 
model. 

To properly operate speech recognition technology, it is important to consistently 
identify the start and end endpoints of the speech utterances. Inconsistently identified 
endpoints may truncate words and may include extraneous noises within the speech 
waveform acquired by the speech recognition technology. Truncated words and/or 
noises may result in poorly trained models and cause the speech recognition 
technology not to work properly when the acquired speech waveform does not match 
any word model. In addition, truncated words and noises may cause the speech 
recognition technology to misidentify the acquired speech waveform as another word. 
In speaker dependent speech recognition devices, problems due to poor endpointing 
are aggravated when the speech recognition technology permits only a few training 
utterances. 

The prior art describe techniques using threshold energy comparisons, zero 
crossings analysis, and cross correlation. These methods sequentially analyze speech 
features from left to right, right to left, or center outwards of the speech waveform. In 
these techniques, utterances containing pauses or gaps are problematic. Typically, 
pauses or gaps in an utterance are caused by the nature of the word, the speaking 
style of the user, and by utterances containing multiple words. Some techniques 
truncate the word or phrase at the gap, assuming erroneously that the endpoint has 
been reached. Other techniques use a maximum gap size criteria to combine detected 
parts of utterances with pauses into a single utterance. In such techniques, a pause 
longer than a predetermined threshold can cause parts of the utterance to be excluded. 
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Brief Description of the Drawings 

The present invention is better understood when read in light of the 
accompanying drawings, in which: 

5 

FIG. 1 is a block diagram of a communication device capable of endpointing 
speech utterances; and 

FIG. 2 is a flowchart describing endpointing speech utterances. 

10 

Detailed Description of the Invention 

FIG. 1 is a block diagram of a communication device 100 according to the 
present invention. Communication device 100 may be a cellular telephone, a portable 

15 telephone handset, a two-way radio, a data interface for a computer or personal 
organizer, or similar electronic device. Communication device 100 includes 
microprocessor 110 connected to communication interface circuitry 115, memory 120, 
audio circuitry 130, keypad 140, display 150, and vibrator/buzzer 160. 

Microprocessor 110 may be any type of microprocessor including a digital signal 

20 processor or other type of digital computing engine. Preferably, microprocessor 110 
includes a speech/noise classifier and speech recognition technology. One or more 
additional microprocessors (not shown) may be used to provide the speech/noise 
classifier, the speech recognition technology, and the endpointing of the present 
invention. 

25 Communication interface circuitry 1 15 is connected to microprocessor 110. The 

communication interface circuitry is for sending and receiving data. In a cellular 
telephone, communication interface circuitry 1 15 would include a transmitter, receiver, 
and an antenna. In a computer, communication interface circuitry 115 would include a 
data link to the central processing unit. 

30 Memory 120 may be any type of permanent or temporary memory such as 

random access memory (RAM), read-only memory (ROM), disk, and other types of 



duration and 10 ms are preferred. For each frame, microprocessor 1 10 determines the 
frame energy using the following equation: 



The parameter fegy n is related to the energy of a frame of sampled data. This 
can be the actual frame energy or some function of it. X x are speech samples. I is the 
number of samples in a data frame, n. N is the total number of frames in the speech 
acquisition window. 

In addition, microprocessor 1 10 numbers each frame sequentially from 1 through 
the total number of frames, N. Although the frames may be numbered with the flow 
(left to right) or against the flow (right to left) of the voice waveform, the frames are 
preferably numbered with the flow of the waveform. Consequently, each frame has a 
frame number, n, corresponding to the position of the frame in the speech acquisition 
window. 

Microprocessor 110 has a speech/noise classifier for determining whether each 
frame is speech or noise. Any speech/noise classifier may be used. However, the 
performance of the present invention improves as the accuracy of the classifier 
increases. If the classifier identifies a frame as speech, the classifier assigns the frame 
an SNflag of 1. If the classifier identifies a frame as noise, the classifier assigns the 
frame an SNflag of 0. SNflag is a control value used to classify the frames. 

Microprocessor 1 10 then determines additional speech waveform parameters of 
the speech signal according to the following equations: 



The normalized frame energy, Nfegy n , is the frame energy adjusted for noise. 
The bias frame energy, Bfegy, is an estimate of noise energy. It may be a theoretical or 
empirical number. It may also be measured, such as the noise in the first few frames of 
the speech acquisition window. 
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In steps 220 through 235, microprocessor 110 determines whether the 
calculated energy centroid is within a speech region of the utterance. If a certain 
percent of frames before or after the energy centroid are noise frames, the energy 
centroid may not be within a speech region of the utterance. In this situation, 
5 microprocessor 110 will use the index of the peak energy as the starting point to 

determine the endpoints. The peak energy is usually expected to be within a speech 
region of the utterance. While the percent of noise frames surrounding the energy 
centroid has been chosen as the determining factor, it is understood that the percent of 
speech frames may be used as an alternative. 

10 In step 220, microprocessor 110 determines whether the percent of noise frames 

in M1 frames preceding the energy centroid is greater than or equal to Valid 1 . While 
M1 may be any number of frames, M1 is preferably in the range of 5 to 20 frames. 
Valid 1 is the percent of noise frames preceding the centroid and indicating the energy 
centroid is not within a speech region. While Validl could be any percent including 100 

15 percent, Validl is preferably in the range of 70 to 100 percent. If the percent of noise 
frames in M1 frames preceding the energy centroid is greater than or equal to Validl . 
then the frame index is set to be equal to the peak energy index, epkindx, in step 235. 
If the percent of noise frames in M1 frames preceding the energy centroid is less than 
Validl, then the method proceeds to step 225. 

20 In step 225, microprocessor 110 determines whether the percent of noise frames 

in M2 frames following the energy centroid is greater than or equal to Valid2. While M2 
may be any number of frames, M2 is preferably in the range of 5 to 20 frames. Valid2 
is the percent of noise frames following the centroid and indicating the energy centroid 
is not within a speech region. While Valid2 could be any percent including 100 percent, 

25 Validl is preferably in the range of 70 to 100 percent. If the percent of noise frames in 
M2 frames following the energy centroid is greater than or equal to Valid2, then the 
frame index is set to be equal to the peak energy index, epkindx, in step 235. If the 
percent of noise frames in M2 frames following the energy centroid is less than Va!id2, 
then the frame index is set in step 230 to be equal to the index of the energy centroid, 

30 icom. With the frame index set in either step 230 or 235, the method proceeds to step 
240. 
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are present. EMINP is a minimum percent of the total window energy. While EMINP 
may be any percent including 0 percent, EMINP is preferably within the range of 5 to15 
percent. If the cumulative energy at STRTNP is greater than EMINP of the total window 
energy, then STRPNT is not an endpoint. The method proceeds to step 250, where 
5 microprocessor 110 decrements STRPNT by X frames. The method then continues to 
step 245. 

If the cumulative energy at STRTNP is less than or equal to EMINP of the total 
window energy, then the current value of STRPNT is the start endpoint. The method 
proceeds to step 260, where the speech start index is equal to the current value for 
10 STRFNT. The method continues to step 265 for microprocessor 1 10 to determine the 
end endpoint. 

In steps 265 through 285, microprocessor 1 10 determines the end endpoint of 
the speech utterance. Microprocessor 110 begins at the Frame Index, basically at a 
position within the speech region of the utterance, and analyzes the frames following 

15 the Frame Index to identify a potential end endpoint. When a potential end endpoint is 
identified, microprocessor 110 checks whether the cumulative frame energy at the 
potential end endpoint is greater than or equal to a percent of the total window energy. 
If the potential end endpoint is the end endpoint of the utterance, the cumulative frame 
energy at that frame should be almost all if not all of the total window energy. The 

20 cumulative frame energy at such frame indicates whether additional speech frames are 
present. In this manner, gaps and pauses in the utterance will not result in a erroneous 
end endpoint determination. 

In step 265, microprocessor 110 sets ENDPNT equal to the Frame index. 
ENDPNT is the frame being tested as the end endpoint. While ENDPNT is equal to the 

25 Frame Index initially, microprocessor 110 will increment ENDPNT until the end endpoint 
is found. 

In step 270, microprocessor 110 determines whether the percent of noise frames 
in M4 frames following ENDPNT is greater than or equal to Test2. While M4 can be 
any number of frames, M4 is preferably in the range of 5 to 20 frames. Test2 is the 
30 percent of noise frames indicating ENDPNT is an endpoint. While Test2 could be any 
percent including 100 percent, Test2 is preferably in the range of 70 to 100 peiceni. 
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CLAIMS 



A communication device capable of endpointing speech utterances, comprising: 

at least one microprocessor having a speech/noise classifier, 

wherein the at least one microprocessor analyzes a speech signal to 
determine speech waveform parameters within a speech acquisition 
window, wherein the speech waveform parameters include a cumulative 
frame energy, an energy centroid of the speech waveform, and a total 
window energy, 

wherein the at least one microprocessor identifies a potential endpoint by 
analyzing frames in the speech acquisition window in relation to the 
energy centroid, and 
wherein the at least one microprocessor validates the potential endpoint is 
an endpoint by comparing the cumulative frame energy at the potential 
endpoint to the total window energy; 
a microphone for providing the speech signal to the at least one microprocessor; 

at least one communication output mechanism. 
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6. A method for endpointing speech utterances, wherein the speech utterances 
have a start endpoint and an end endpoint, comprising the steps of: 

(a) analyzing a speech signal to determine speech waveform parameters 
within a speech acquisition window, wherein the speech waveform parameters include 

5 a cumulative frame energy, an energy centroid of the speech waveform, and a total 
window energy; 

(b) identifying a potential start endpoint by analyzing at least one of noise and 
speech in frames in the speech acquisition window that precede the energy centroid; 
and 

10 (c) validating the potential start endpoint is the start endpoint by comparing 

the cumulative frame energy at the potential start endpoint to the total window energy. 
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